Module Communication in Experimental Applications

A data transport is the provision of data in a desired target system, which is different from the system the data originated from This transport process is implemented using different methods, for example storing the data on a hard disk which can be accessed by the different operating systems.

For engineers it is fairly easy to implement the previously described method, additionally it provides extra safety factors by caching the data separately. In the uncommon event of a program crash, the data is still available. However, for the operation of this variant an additional hardware component, which paralyzes the entire structure in case of a failure is required. In the case of high frequency applications, it is necessary to ensure that the hard disk can implement a sufficient data transfer rate. When recording measured values within a file, the intermediate status must be saved after each data iteration. Additionally the processing speed during opening, saving and closing can also be the limiting factor of the system for high frequency applications. Since these work steps must follow one another, the implementation of an architecture by several threads is not possible. Due to the reasons mentioned above, one should always consider carefully whether the described method of data transport is suitable for the individual application.

Another method of data transport makes it possible to dispense with the use of a hard disk. This involves direct communication between the data generating system and the data processing system.

A disadvantage of this approach is a possible loss of data in case one of the subsystems crashes. However, with the help of advanced methods, such a data loss can be prevented and this disadvantage can consequently be compensated. Direct communication between the different systems requires a more complex architecture of the system and the software, but several considerable advantages follow. For example, the system does not have a system relevant hard disk, which represents a single point of failure. Furthermore, the frequency of data generation is no longer limited by memory speed, since storage in the form of caching the data does not occur. The processing speed is also no longer a limiting factor, since further data processing can be divided among any number of threads. Therefore, the only limiting factor is therefore the maximum transfer speed set by the transfer technology used. In conclusion, this means that much faster processes can be implemented with the help of direct data transmission.

If one decides to implement direct communication between two communication partners, further challenges arise which do not exist in the same form in the application case of caching on a hard disk. It is necessary that the sender as well as the receiver can compose the messages correctly and read them correctly. If a communication partner cannot interpret the received message, the information contained in it is lost. In the case of data stored on the hard disk, which contains an error, it may be possible to decrypt the data at a later date and thus recover it.

It is therefore of enormous importance that the sender and receiver transmit the data in a form that can be understood by both sides, even in the event that different types of data occur within a message. In practice, for example, it happens that a measured value is transmitted in the form of a float, to which a time stamp in the form of a string also belongs. This clarifies the requirement of the covering of different data types within a message.

Since data is also exchanged via different systems and interfaces, it is consequently important to adopt an environment independent form of the data. This means that this form must be compatible with the applied programming languages, as well as operating systems. It does not make sense to transfer whole files, similar to the files stored on the hard disks, since these have too large a memory requirement for efficient data transfer. In this case, it makes much more sense to use a standardized transfer protocol. Transmission protocols represent an encryption technology that converts a wide variety of data into a single data type in a clearly defined form, often a string or binary data. On the basis of the well defined rules of these technologies, a conversion of different programming languages of different operating systems is possible in both directions. This ultimately ensures that the resulting data can be sent efficiently.

It becomes clear that direct data transfer between different modules is possible by the methods mentioned. This makes it feasible to implement a system that does not require the additional hardware component of a hard disk. Thus certain risks can be avoided. However, these advantages are associated with an increased implementation effort, since it is necessary to establish a direct communication between the communication partners and to ensure that this communication also works fail safe. Furthermore, the implementation of the encryption protocols can involve considerable effort to ensure that the communication partners understand each other without loss.

There are a large number of different software providers for the described tasks of data transfer and data encryption. It is therefore necessary to select a suitable technology for the specific application that you want to implement via direct data transfer between different modules.

Data Transmission in Experimental Applications

A number of network protocols is available for the transport of data packets, which are assigned to the collective term TCP/IP. This protocol family is hardware and software independent thus it offers a uniform standard for network communication. This enables data to be exchanged within the network. Therefore, it is possible to distribute data packets directly via these network protocols. Beside the use of raw TCP/IP network protocols there is also the possibility to access extended frameworks. This enables tailor made communication solutions to be implemented more effectively.

Above all such extended frameworks offer the possibility of the simpler scaling of several interfaces. By the RPC technology (Remote Procedure Call) a client-server model can be implemented. In other words, the client can request the server to complete tasks or to send data.

Well known technologies in this context are gRPC, an environment independent RPC framework which has various functions for monitoring and task distribution. It is possible to perform work load balancing, i.e. to distribute tasks given by a client to several servers, also called workers, or to perform a health checking of the different communication participants.

Another solution is MQTT. This is an extremely lightweight publish-subscribe structure designed for constrained devices and networks with high latency, low bandwidth, or failure prone networks. When errors occur, MQTT is used to continue a rapid resumption of the transmission processes.

A further communication interface is ZeroMQ. It falls in the class of lightweight communication libraries that extend the standard sockets with numerous functions. Special features are that asynchronous message queues are implementable, as well as the implementation of reliable, low maintenance structures using lightweight code. For the implementation of the work task in the course of this master’s thesis, it was decided to use the library ZeroMQ.

Transmitting Experimental Data using ZeroMQ

ZeroMQ is an open source library that enables asynchronous data exchange between distributed systems within a network. It was developed for the application of concurrent systems. The data is sent as strings via sockets. Different methods can be used for unicast, but also for multicast.

A special feature of ZeroMQ is that it does not require a message broker. This means that the strings are exchanged directly between the applications involved. However, local data exchange between different processes can also take place via ZeroMQ. A major advantage of the open source library is its platform independence.

Also common operating systems, such as Windows, Linux or MacOS are supported. Through a variety of APIs, different programming languages, such as C, C++, Go, Java, Python and many more are supported.

Socket API

Compared to traditional socket programming, the complexity of ZeroMQ has been reduced to a minimum. Thus, the socket API takes care of routing, framing, establishing and terminating the connection, as well as restoring an interrupted connection.

The APIs are designed so that they can be used like traditional Berkeley sockets. Table 1 lists the named sockets and their functions.

Socket Function
zmq_socket() Creating a socket
zmq_close() Destroying a socket
zmq_setsockopt() Setting options on the socket
zmq_getsockopt() Getting options of the socket
zmq_bind() Plugging sockets into the network topology
zmq_connect() Plugging sockets into the network topology
zmq_send() Unipart message write data on socket
zmq_recv() Unipart message read data from socket
zmq_poll() Multiplex input/output event trigger
zmq_msg_init() Multipart message initialize
zmq_msg_init_size() Multipart message initialize size
zmq_msg_init_data() Multipart message initialize data
zmq_msg_send() Multipart message write data on socket
zmq_msg_recv() Multipart message read data from socket
zmq_msg_close() Multipart message release data
zmq_msg_data() Multipart message access data
zmq_msg_size() Multipart message access size of data
zmq_msg_more() Multipart message extend data
zmq_msg_set() Multipart message set data
zmq_msg_get() Multipart message readout data
zmq_msg_copy() Multipart message copy data
zmq_msg_move() Multipart message move data

Table 1 : ZeroMQ Sockets and their Functions

Before a socket can be created, a ZeroMQ Context must be set up. This is a container that defines all the ZeroMQ sockets created in the process. A connection between two end points can then be established. For this purpose, a node is bound (zmq_bind()), which then can be accessed by another node (zmq_connect()). It makes no difference whether the transmitter or receiver side calls zmq_bind() or zmq_connect(). The nodes can be network nodes, processes or threads. As already mentioned above, ZeroMQ sockets automatically reestablish a lost connection. This means that nodes can come and go at any time and the order in which the different interfaces are started is arbitrary. This possibility does not exist with the traditional Berkeley sockets.

The communication between the sender and the receiver is encoded by the ZMTP (ZeroMQ Message Transport Protocol). This is a binary coding. A serialization of the data does not take place automatically, but can be realized in a few simple steps by the programmer via a freely selectable format (for example XML, JSON, […]). More detailed information about the serialization of the data in this test rig can be found in chapter XY.

The framing of the ZMTP is very simple and consists of a header and the body. The header contains the amount of data of the message to be transmitted and the body contains the entire message as a string. Figure 7 shows one possible example of the above described message structure.

ZeroMQ allows messages to be transmitted using two different APIs, depending on whether they are unipart or multipart messages. The simpler option out of the two uses only two sockets, where one is designated for sending (zmq_send()) and the other one for receiving (zmq_recv()) the message. By using this method, the message length strictly depends on the buffer size, because the message is not stored. Therefore, it is recommended to use this method only for short messages that are not expected to vary in length.

The second API, which can also handle multipart messages, is more complex to use, but offers advanced ways to work with the data. The messages are stored in the memory in the zmq_ms_t structure, so multiple messages that were sent individually can be contained.

Transport Protocol

ZeroMQ supports four different transport protocols, which are listed in Table 2.

Transport Protocol Connection String Usage
TCP tcp://hostname:port Network communication. Clients and servers can connect and disconnect at any time (disconnected TCP).
INPROC inproc://name Communication between threads (inter thread). Faster than TCP and ICP.
IPC ipc://tmp/filename Communication between processes on the same host (inter process). Connectionless as disconnected TCP.
(E)PGM (e)pgm://interface:address:port Multicast transport trough network. PGM for IP datagrams, EPGM for embedded PGM in UDP datagrams.

Table 2: ZeroMQ Transport Protocols

Another special feature is the independence of the socket APIs from the protocol used. Therefore, the application is structured in the same way, regardless of which protocol is used. If a change in the communication protocol is needed, it can be carried out without having to make major adjustments.

A socket can accept multiple incoming and outgoing connections. The combination of different addresses and protocols is also allowed. Figure 8 shows an example code of how this can be implemented.

In this example, the previously set up ZeroMQ socket allows connections to the address 127.0.0.1 with the port 5555. Likewise, all network connections are enabled on the port 5000. In parallel, individual threads of a multithreaded application can be connected. Thus, over the inproc connection data of the Thread “worker_thread” are received.

Basic Messaging Patterns (using the Example of ZeroMQ)

In a distributed system, the applications have to be connected to each other so that communication can successfully take place. However, one has to ensure that the predominant roles of the communication participants are clearly defined.

ZeroMQ has four different topologies of communication flow. Request – Reply, Publish – Subscribe, Parallel Pipeline, and Exclusive Pair are classic socket types, which define the relationship between sender and receiver. Since in computer science code structures that are working as mentioned are shared over different platforms as for example literature or the internet, numerous design patterns, also called patterns, already exist for a wide variety of problems. These templates often only need to be adapted by the programmer to his specific use case. In the following sections, these built in ZeroMQ core patterns are discussed in more detail. They are based on ZeroMQ documentation by Peter Hintjens that can be found here: https://zguide.zeromq.org/docs/chapter2/.

Request – Reply Pattern

The classic Req-Rep pattern [reference pattern page] consists of a client and a server and is the simplest way to use ZeroMQ.

A client sends the request (in this example “Hello”) and the server responds directly to the client (in this example “World”).

The communication is synchronous, which means the server and the client each block their process as long as a request and a response, respectively, are received. The fact that the client and server must be coupled explicitly with each other results in a strong connection.

However, as soon as one of the two sides fails, the entire system is blocked. In ZeroMQ, the request-reply topology is implemented by using the REQ and REP socket pairs. A REQ socket can connect to a number of REP sockets. When this happens, the client’s requests are distributed equally among all connected servers (round robined).

An asynchronous request-reply communication can be realized with the help of a ROUTER DEALER pattern pair.

Publish – Subscribe Pattern

The Publisher-Subscriber pattern connects a set of publishers with a set of subscribers. In other words, Publisher publishes data that is subscribed by at least one subscriber(s). As long as the patterns are active, there is no beginning and no end to the stream. As soon as new data is published, it is received by the subscribers. Participants can leave, arrive or drop out at any time without affecting the rest of the system.

Communication in this case can take place via a broker, that is in fact a middleware, which forwards the published data to the correct subscribers. This gives the advantage that publishers and subscribers do not need to know about each other’s existence, they only need to be connected to the broker. A disadvantage is that this can be a single point of failure, which can potentially block the entire system as soon as an error occurs.

With ZeroMQ, however, it is also possible to implement direct communication without a broker. This eliminates the risk of creating an interface that could potentially bring the entire system to a standstill in the event of an error.

With this communication topology, the data transfer runs only in one direction. Only from the PUB to the SUB socket can be sent. In the case of a simple publisher-subscriber communication link it is not possible to check whether all the data sent out by the publisher arrived at the subscriber without any losses, since the publisher has no knowledge of who the data is being sent out to. Therefore, there exists no strong connection.

Parallel Pipeline Pattern

The pipeline pattern[reference pattern page] is intended for distributing tasks to different workers. In this way, large task packages can be executed in parallel and the computing time is thus reduced. Using PUSH sockets, the tasks are distributed to the workers, who receive them using PULL sockets. Once the tasks are completed, the resulting data is sent from the workers to the collector using the PUSH sockets, which receives the data using the PULL sockets.

Exclusive Pair Pattern

Exclusive Pair uses PAIR sockets to exclusively pair two nodes together. Only one connection is possible at a time. These sockets are not intended for use within networks, but for communication between two threads. The communication is classified as bidirectional.

Reliable Request – Reply Design Patterns (using the Example of ZeroMQ)

In many applications and especially in experiments it is particularly important not to lose data. Sometimes it can happen that applications freeze or crash unexpectedly, for example the broker used. This corresponds to a single point of failure. There are also other sources of data loss. For example, too slow processing of the received data can cause the queue to overflow and messages are being lost. The connection to the network can also be lost temporarily for a variety of reasons. Although the ZeroMQ patterns connect on their own when they restart working, messages sent in the meantime can be lost. Therefore, depending on the use case, it may become particularly important to design reliable communication patterns that minimize the risk of data loss due to certain incidents.

Classical request-reply patterns constantly wait for the response of the socket partner. As soon as one of the sockets crashes or something goes wrong during the transfer, the program is blocked and has to be restarted. To prevent freezing, some ways to build more reliable request-reply patterns are listed below. With each new pattern the previous pattern is extended. The complexity of the programs increases thus with each further pattern, therefore the reliability and the possibilities increase.

The request-reply patterns are divided into broker based systems and systems without brokers. In the case of broker based systems, the Majordomo pattern is a very powerful variant that can offer high reliability and strong performance at the same time. The same applies to the Freelance Pattern without a broker. Example programs for these can be found in the documentation of ZeroMQ. SOURCE The code is provided in the programming languages C, Haxe, Java, Python, Ruby and Tcl.

Client-Side Reliability (Lazy Pirate Pattern)

A simple variant of a reliable request reply solution is the so called lazy pirate pattern.

Here only the client, i.e. the requester, is adapted. This is reasonable because the client is usually in danger of freezing due to a response being expected after a request was sent out.

To work around the described problem, the Lazy Pirate pattern adds three functionalities. To prevent the zmq_recv() function from blocking the application, the zmq_poll() function is used first to check for a response. If this is the case, only the command to receive the data is executed. Furthermore, the zmq_poll() function is reveiving a timeout, which limits the search for a response. In case of no response is found, the request is sent out again. This process is also limited to a predefined number of runs, after which the retry is aborted and the notification is issued that the server is unreachable.

By implementing these functions, client freezing is no longer expected. If the client does not receive a response, it restarts the request to the server as many times as needed until the maximum number of attempts is reached. With the help of the above pattern it is possible not to block the program when there is no response from the server. In other words it is possible to react to the prevailing problem.

Basic Reliable Queuing (Simple Pirate Pattern)

The Simple Pirate pattern [reference pattern page] extends the Lazy Pirate pattern by a functionality that is particularly useful for applications in which several servers are represented, for example in the form of workers that are assigned to specific tasks. When a worker is assigned a task, it completes it and then returns a signal that it is ready to take on a new task. The Simple Pirate pattern has a broker, which forwards the requests of the clients to available workers. Because clients and workers do not know about each other’s existence, any number of workers and clients can be realized without problems. This makes it simple to extend the pattern to bigger systems.

The client’s retry mechanism, previously described in the Lazy Pirate pattern, is implemented in the broker causing it to make the request to the workers again if no response to a previous request was received.

The main advantage of this pattern is the extension of the request-reply structure by numerous clients and workers without having to make a big effort. However, there is the risk that the broker fails causes the entire program to a standstill. This means that the broker is a single point of failure, this dependence should be avoided if possible.

When the broker is restarted after a crash, the program will not work either, because the signal sent out by the workers saying that they are ready to accept a new task was not perceived by the crashed broker. Thus, the broker has no knowledge about ready workers and cannot distribute tasks. As a consequence, all existing workers must also be restarted. To make the Simple Pirate pattern even more reliable, individual programs can be supplemented by a so called heartbeat, which will be explained in the following section.

Heartbeating

The heartbeat of a program indicates whether the program is reachable or not. This functionality gives a signal with each iteration of a loop. The receiver of the heartbeat knows about the status of the program, whether it frozen or crashed. There are different implementations of the heartbeat.

One – Way Heartbeat

First of all the one way Heartbeat consists of a simple, constantly recurring message, which is sent from a node to the associated peer. If the peer does not directly receive a heartbeat from the node for a certain time, the node is declared as dead. This heartbeat mechanism is the only one which can be used for pub-sub applications, since there exists only a one way flow of information. An advantage is that the publisher can send a heartbeat every second, this gives the subscriber the necessary information about the publisher’s status, even if no data is published otherwise. In general it should be noted that in the case of large data transfers, the heartbeat might arrive late at the peer. If the timeout is selected incorrectly, a node can be declared dead even though it might still be alive.

Ping – Pong Heartbeat

The so called Ping-Pong Heartbeat is used for all applications that allow bidirectional communication, such as request-reply. A heartbeat replaces the node and peer simultaneously As soon as a participant sends a heartbeat (ping), the counterpart responds with a heartbeat (pong).

Robust Reliable Queuing (Paranoid Pirate Pattern)

Another design pattern, the so called Paranoid Pirate pattern[reference pattern page] extends the previously mentioned Simple Pirate pattern including the function of the Ping-Pong Heartbeat as described above.

The information about the availability of the participants is exchanged between the dealer and the individual workers. Therefore, the broker knows at any time, which workers are still active and then saves the request to failed workers. Additionally, workers know if the broker is still active. In case the broker has crashed and is restarted, the workers can signal that they are ready to accept work. Furthermore, in the Paranoid Pirate pattern, the worker is connected with a dealer socket instead of a request socket. The dealer pattern has the advantage that messages can be sent and received at any time, only if the sequence of sending and receiving is fixed in the request pattern.

Service-Oriented Reliable Queuing (Majordomo Pattern)

The Majordomo pattern [reference pattern page] is a modification of the Paranoid Pirate pattern into a fully service oriented broker. The general idea is that the clients specify the requested task by type. Then the broker assigns this task to a specialized worker with he corresponding capacities.

It is necessary that a protocol is created, defining how the workers and clients must be structured and set up in order to successfully connect to the broker.

Furthermore, to guarantee the function of the heartbeats, it is needed that the workers are single threaded, i.e. that they execute their tasks sequentially and do not execute parallel tasks. Thus, it is guaranteed that the heartbeat is not sent out, if the main task of the respective program is frozen. The broker on the other hand can be extended by several threads. This is useful if the number of clients and workers is particularly large. This leads to each thread managing one set of clients, as well as one set of workers.

A disadvantage of the Majordomo pattern is that the data management is not designed for performance. A test run of the pattern shows that 100,000 requests require a turnaround time of 14 seconds. This slow time is caused by two points. On one hand, data frames are copied and forwarded several times within the broker, which unnecessarily consumes the main memory. On the other hand, the synchronous procedure in the request-reply process is the factor for the poor performance. This means that with each started request the program waits for a response and only moves on to the next one when the response is received or after repeated failure.

Asynchronous Majordomo Pattern

In the Asynchronous Majordomo pattern [reference pattern page], the poor performance picked up in the Majordomo pattern is improved by introducing asynchronous processes. Here, the zmq.send() method is executed within its own loop. Thus, the whole requests are distributed to the individual workers much faster than before, since the new request is not only sent out only after receiving a response. Inside another loop is the zmq.recv() method, which receives the expected responses under consideration of the set timeouts. With the help of these small adjustments, the Asynchronous Majordomo pattern can quarter the throughput speed of the 100,000 requests compared to the Majordomo pattern. In this constellation, 25,000 request-reply executions per second are feasible. Regarding the narrow code of the present pattern, this performance is very good.

However, there is still room for improvement in the handling of data within the broker. Furthermore, the restart of the broker, which as already mentioned is still a single point of failure, is not implemented and thus represents the greatest weakness of the described design pattern. Eliminating these vulnerabilities requires a deep extension of this pattern. But is necessary for larger architectures such as a web service with thousands of users.

Disconnected Reliability (Titanic Pattern)

The Titanic pattern [reference pattern page] extends the discussed Asynchronous Majordomo pattern with the possibility to store data temporarily on a hard disk. Although this adds another component which can fail to the system and moves it further away from an narrow code, this extension offers a major advantage. Namely, with the addition of hardware storage, it is possible for the client to go offline or perform other tasks between the request and the response. In the previous design patterns, the client always had to wait for the response in real time.

If a client makes an urgent request that needs to be processed as quickly as possible, the broker forwards it directly to the next available worker that is capable of performing this task. However, if the request is not a time critical task and a short delay in the response does not pose a problem for the system, this request can be passed on to the program, which is referred to as Titanic in the figure. The request is first stored immediately on the hard disk with a unique ID, which is composed of the request and the client. At the next possible time when an unused worker is available and is not being used by an urgent request from another client, the Titanic program forwards the request to this worker. As soon as the response is available, the Titanic program writes it back to disk.

After an unspecified time, the client can query Titanic to see if the response to the submitted request is already available. If this is not the case, the client receives this information. But if the answer is available, the Titanic program takes it directly from the hard disk and delivers it to the client.

This procedure has several advantages. First of all, the client has the opportunity to complete other tasks while its request is being processed. In addition, in the event of a crash of Titanic, the broker or the worker, less data is lost, due to a large part of it being temporarily stored on the hard disk. Since the Titanic program runs as a multithreaded application, it can process numerous clients and workers simultaneously.

This design pattern is also a solution is not designed for optimal performance. For example, all requests and responses that are stored on disk are also stored in individual files. However, dealing with one large file is more efficient than dealing with numerous files. Therefore, this should be adjusted if the performance is to be increased. When it comes to speed, a solid state drive is preferred to a classic disk drive. Another gain in reliability extanding the design pattern to use multiple disks. Then the failure of one disk does not cause a large loss of data. If the user wants to make the design pattern very performant, one can get rid of the hard disk completely and store the data in the temporary memory, but loses the big advantage of reliability against data loss in case of crashing programs.

Concluding, in which direction the described Design patterns are developed further, in order to represent an optimal solution, depends completely on the application.

High-Availability Pair (Binary Star Pattern)

The Binary Star pattern [reference pattern page] is a constellation causing a high degree of reliability by providing a backup server. Thus, a backup server is available for one server. In the event of a crash of the primary server, after a short time the secondary server is activated and takes over the tasks that the primary server should have processed in the normal case.

The primary and secondary servers consist of exactly the same code. Their roles are determined by the execution of both programs. The server that is activated first receives a request and takes the role of the primary server.

Because the primary and secondary servers are in direct contact and check each other, they know each other’s status. Consequently, the secondary server detects when the primary server becomes unreachable. The client has the address of both servers. It detects whether the primary server is active based on the heartbeat described above. If it is not, they automatically connect to the passive server. After the secondary server has started up, the clients can continue their regular operation without any restrictions. The Binary Star pattern has the ability to effectively intercept catastrophic system failures caused by hardware failures. Also, a scheduled switch between the two servers can be performed, for example, in need of maintenance cycles.

An automatic switch to the primary server when it is operational would be easy to implement, but is not provided for within this pattern. In these cases, the administrator has to decide when it is the right time to switch the servers again.

It needs to be mentioned that the secondary server usually does not perform any work unless the primary server fails. Consequently, the permanent operation of both servers generates additional costs. One must be willing to pay these costs for the gain in reliability if one wants to use the design pattern of the binary star pattern.

Brokerless Reliability (Freelance Pattern)

As already mentioned, a major advantage of ZeroMQ is the ability to implement peer to peer communication with a broker and a single point of failure. However, setting up a reliable peer to peer architecture requires greater programming effort. The Freelance pattern [reference pattern page] represents a peer to peer solution without needing the broker of the previously developed, reliable design patterns.

In order to avoid passing the existing server addresses to each client via hard coding, the Freelance pattern uses a name service. The purpose of a name service is to create a logical name for each connection and gives the individual addresses as an output value. Thus, the name service is the only application in the pattern where the addresses are hard coded. Furthermore clients can query the name server for the addresses of the servers they are looking for to perform their tasks. By using the zmq_bind() function the servers are hard coded to an endpoint. Clients can then connect via three types of sockets to the address of the desired server and finally make a request. The sockets are either a request socket, a dealer socket, or a router socket, the main differences are explained in the following.

When using a request socket in the sense of the lazy pirate pattern, only the known reconnect is attempted as soon as a connection is not established. This means that the client requests every available server at least once for each request. If there is a problem, the client waits for multiple timeouts before realizing that the request does not encounter a valid server. For these reasons, this solution is not practical. Using the dealer socket, the request is sent to all servers that are present in the network. Only the first response of a server is accepted by the client, all other responses are ignored. This ensures that the client gets the response to the request as quickly as possible. However, this process involves a very inefficient use of servers, because they are all processing the request and cannot react to any other requests in the meantime. Furthermore, this results in more network traffic than necessary. The main advantage of this implementation is its simplicity regarding signal processing.

The third option is using a router socket, this ultimately allows smart sending of requests to specially selected servers. However, it turns out to be very complex in design depending heavily on the constraints of the task to be performed.

Slow Subscriber Detection (Suicidal Snail Pattern)

A well known problem with publish-subscribe structures is the so called slow subscriber. This means that the subscriber cannot process the data received from the publisher quickly enough. As a result, data accumulates at the subscriber over time, which can then lead to delayed processing and ultimately causes the subscriber to crash. Based on a crash, it can be difficult to determine that this is the slow subscriber syndrome, troubleshooting in this regard can be very difficult. For this reason, when a slow subscriber is detected, one wants to get it to shut down in a controlled manner. This task is performed by the Suicidal Snail pattern [reference pattern page]. This pattern equips the subscriber with an extension, which assigns incoming data a timestamp. If the subscriber detects that a time delay set by the programmer is exceeded while processing the data, the subscriber issues a notification that it cannot keep up with the publisher’s speed and then terminates all it’s actions.

Thus, the programmer directly gets the feedback why the program does not work reliably and can start working on fixing this problem.

High Speed Subscribers (Black Box Pattern)

One way to increase the subscriber’s performance is addressed in the black box pattern [reference pattern page]. With a very busy publisher (for example, 1,000,000 messages per second), the subscriber will inevitably run into the slow subscriber symptom if he wants to perform an action with every incoming message. In these cases, the black box pattern can be used.

In this design pattern, the actual work on the received data is not performed by the subscriber. After receiving it, the subscriber forwards it directly to different workers, which then perform the data processing. This allows the work to be divided and performed more efficiently. The subscribers job changes to a work balancer. The advantages of individual transport protocols shown in Table 2 are also used. Since the subscriber and the associated workers are located within multiple threads on the same computer, the communication can take place via INPROC in this case.

A further increase in efficiency can be achieved by designing the subscriber as a multithreaded application.

Reliable Pub-Sub (Clone Pattern)

The clone pattern aims to represent a reliable publish-subscribe pattern. A server publishes state updates, which are then received by several clients representing different applications. The state is sent out in the form of a key-value pair, e.g. a measured value and the corresponding identification key. The client stores this key-value pair in a hash table. Thus, a published state has different possibilities of dealing with entries. It can either represent a new entry, replace an old entry or delete an entry from the hash table. The hash table is stored in temporary memory instead of in a database.

Dealing with the crash of a client

If a client crashes in this constellation, its hash table is lost and the client may no longer be able to work as required. To solve this problem, it is possible that the server also keeps all published states in a hash table. [reference pattern page] In case of a crash and a restart of the client, it subscribes the stream again. After receiving the first data, the client can start a state request to the subscriber and ask for the previously published data that was sent out before the last data was received. The subscriber sends the affected section of the stored hash table to the client. Then the client has access to the data and can resume its regular operation.

The server must be an application with at least two threads, where one thread publishes the states and the other thread processes state requests.

Changing data by clients

In some use cases, it is necessary that clients can make changes to the data after initially working with it. This must then be distributed to all other clients. To prevent all clients from having the same data sets, it is essential that the client pushes the changed data set to the server so that it can send the update to all clients.

Dealing with many Clients

As the number of clients increases, the approach of sending all data to all clients will be inefficient. Since, the clients only need a part of the published data, it makes sense to divide the data into so called subtrees [reference pattern page]. Clients can then subscribe only to the corresponding subtree, or in the event of a crash, request only the data of the subtrees with a state request. This selection of relevant data minimizes the memory requirements and increases the efficiency of data processing by the clients.

Prevention of saving Data that is not needed

The clone pattern uses a functionality which is called ephemeral values. When a data record exceeds the previously defined time to live (TTL), it is automatically deleted. The removal can be prevented by constantly refreshing the data record. [reference pattern page] This ensures that redundant data is deleted and the memory does not get filled permanently.

The Binary Star Pattern for Pub-Sub Applications

The current configuration of the Publish-Subscribe program allows the crash and restart of the clients without any lasting data or performance losses. If the server crashes, operation can be resumed without additional problems. But the hash table of all data published up to the time of the crash is lost. If a client makes a state request after the server has been restarted, it will only receive the data starting from the restart of the server.

To protect the program against data loss by the server, the method of the Binary Start pattern, previously described in the chapter on reliable Request-Reply patterns, is used. For this to be successful another (independent) server has to be created. Then the hash table is stored on the primary and secondary server simultaneously.

Updates of the clients in the Binary Star pattern [reference pattern page]. are executed via Publish Subscribe Sockets instead of Push-Pull Sockets. Thus, connecting both servers to the updates is leaner in programming. If the primary server fails, the backup server can step in and continue the work as it’s own place.